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Abstract. We give the first sorting algorithm with bounds in terms of 
higher-order entropies: let S be a sequence of length m containing n 
distinct elements and let Hi(S) be the fth-order empirical entropy of S, 
with n l+1 \ogn € 0(m)\ our algorithm sorts S using (Ht(S) + 0(l))m 
comparisons. 

1 Introduction 

Sorting in the comparison model is one of oldest problems in computer science, 
but it remains an important and active area. Previous research has shown how 
we can take advantage of various kinds of pre-sortedness, such as long runs, 
few inversions, or only a small number of elements out of place (see ^0]); m this 
paper, we show how we can take advantage of low entropy to reduce comparisons. 

Consider a fixed sequence S — Si,...,s m containing n distinct elements 
drawn from a total order. For any non-negative integer £, the Ith-order empirical 
entropy of S, denoted Hi(S), is our expected uncertainty about Si (measured in 
bits) given a context of length as in the following experiment: we are given 
S; i is chosen uniformly at random from {1, . . . , m}; if i < I, we are told Si\ if 
i > £, we are told Si-#, . . . , Sj_i. Specifically, 



Here, a € S means a occurs in S; # a (S) is the number of occurrences of a in 
S; log means log 2 ; A t is the set of -^-tuples in S; and S a is the sequence whose 
zth element is the one immediately following the ?th occurrence of a in S. The 
length of S a is the number of occurrences of a in S unless a is a suffix of S, in 
which case it is 1 less. 



Notice log n > H Q (S) > ■ ■ ■ > H m -i(S) = H m (S) = ■ ■ ■ = 0. For example, if 
S is the string TORONTO, then logn = 2, 




H (S) = ilog7+^log^ + ilog7+^log^l.84 , 
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Ht(S) = - (H (S N ) + 2H (S o ) + H (S R ) + 2H (S T )) 

= ~ (ffo(T) + 2H (RN) + H (O) + 2H (OO)) 
= 2/7 0.29 

and all higher-order empirical entropies of S are 0. This means, if someone 
chooses a character uniformly at random from TORONTO and asks us to guess 
it, then our uncertainty is about 1.84 bits. If they tell us the preceding charac- 
ter before we guess, then on average our uncertainty is about 0.29 bits; if they 
tell us the preceding two or more characters, then we are certain of the answer. 
The difference between Oth-order and higher-order empirical entropies can be of 
practical importance: the encodings produced by most older compression algo- 
rithms are only bounded in terms of the Oth-order empirical entropy of the input, 
whereas those produced by most modern compression algorithms are bounded 
in terms of higher-order empirical entropies. For example, Manzini [7] proved 
Burrows and Wheeler's algorithm [5] encodes S using at most 

(8H e + 0(l))m + N\2N log N + 9) 

bits, where i is any non-negative integer, N is the size of the alphabet and, 
depending on the implementation, the hidden constant is about 2/25. 

Suppose we want to sort S, that is, to put the elements of S in non-decreasing 
order. Many familiar sorting algorithms already take advantage of low Oth- 
order empirical entropy: Munro and Spira [§] proved MergeSort, TreeSort and 
HeapSort use (H (S) + 0(l))m ternary comparisons 1 ; by the Static Optimal- 
ity Theorem ^J, SplaySort uses O((H (S) + l)m) comparisons; Sedgewick and 
Bcntley ^2] recently proved Quicksort uses 0((Hq(S) + l)m) comparisons in 
the expected case. 

In Section |2 we give a new algorithm that sorts S using (H (S) + 0{l))m 
comparisons. In Section|3we generalize it so that, given a non-negative integer i 
with n e+1 log ?i £ 0(m), it uses (Hi(S) + 0(l))m comparisons. Our algorithm's 
main disadvantage is its slowness: it takes 0((Hi(S) + l)?7jlogn + im) time, 
whereas the algorithms mentioned above take 0((Hq(S) + l)m) time. It works in 
models where, for t < m, it takes O(logt) time to perform a standard operation 
on a balanced binary search tree with t keys, each of O(logm) bits [2]; if such a 
tree takes O(t) space, then our algorithm takes O(m) space. We emphasize that 
we do not make assumptions about the source of 5 1 , nor do we use randomization 
or pointer arithmetic. 



1 A ternary comparison of x and y tells us whether x < y, x — y or x > y; a. binary 
comparison only tells us whether x < y or x > y. Our algorithm uses binary com- 
parisons, which is a slight advantage: while most instruction sets support ternary 
comparisons, most high-level languages do not; a ternary comparison is usually im- 
plemented as two binary comparisons [Q. 
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2 Sorting S using (H (S) + 0(l))m Comparisons 

If we are given a list of the distinct elements in S and their frequencies, then we 
can easily sort S using fewer than (Hq(S) + 2)m comparisons: we construct a 
nearly optimal leaf-oriented binary search tree T, as described in Subsection l2.ll 
and perform an insertion sort into T. A leaf-oriented binary search tree (LBST) 
is one in which the data are stored at the leaves. 

Since we are not given that information, we instead start with an LBST T\ 
on s\\ for i from 2 to m, we search for Sj in T^_i and then "in effect" construct a 
new LBST Tj which is nearly optimal for sx, . . . , Si. In Subsection 12 . 21 we prove 
this uses (Hq(S) + 0(l))m comparisons. Of course, actually constructing every 
Tj would be very slow; in Subsection l2.3l we show how we can quickly "in effect" 
construct them. We used a similar approach in [5] for dynamic alphabetic coding. 



2.1 Constructing a Nearly Optimal Leaf-Oriented Binary Search 
Tree 

Let a\,...,a n be the distinct elements in S in increasing order. By Shannon's 
Noiseless Coding Theorem if we search for si,...,s m in an LBST on 

ax,..., On, then we use at least Ho(S)m comparisons. Mehlhorn [H] gave an 
(9(n)-time algorithm that, given ax,. ■ ■ ,a n and # ai (S), . . . , #a„(>S'), constructs 
an LBST with which we use fewer than (Hq(S) + 2)m comparisons; we follow 
Knuth's jj presentation. 

Theorem 1 (Mehlhorn, 1977). We can construct a leaf-oriented binary search 

-1. 



tree on ax, ■ ■ ■ , a n whose leaves have depths 
Proof. For 1 < i < n, let 



log 



log 



#a„(S) 



fi 



i-1 

E 



#gi(g) , #q i (-S) 
m 2m 



Since \fr - f v \ > 



#q,(g) 
2m 



for i' ^ i, the first 



log 



1 bits of /,'s binary 



representation suffice to distinguish it; let o~i be this sequence of bits. Notice 
ax, ■ ■ ■ , cr n are lexicographically increasing. 

We construct a binary tree such that, for 1 < i < n and 1 < k < |crj|, the 
fcth edge on the path from the root to the ith leaf is a left edge if the fcth bit of 
Cj is a 0, and a right edge if it is a 1. We store ax, ■ ■ ■ , a n at the leaves. At each 
internal node v, if v has two children, then we store a pointer to the rightmost 
leaf in v's left subtree. □ 



Consider the LBST this algorithm produces. When searching for s^, we start 
at the root and descend to the leaf that stores Si, as follows: at each internal 
node v, if v has two children and the rightmost leaf in v's left subtree stores 
element a, then we compare Si with a and proceed to v's left child or right child 
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depending on whether Sj < a; if v has only one child, we proceed immediately 
to that child. Searching for s\, . . . , s m , we use a total of at most 



£# 0i (s) 



log 



TO 



+ 1 < ( J ff (5) + 2)m 



comparisons. 



2.2 Using a Sequence of Leaf-Oriented Binary Search Trees 

Let F be the set of indices i such that Si is the first occurrence of that element 
in S; that is, F = {i : Si £ s\, . . . , Si_i}. For 1 < i < m, let Ti be the nearly 
optimal LBST Mchlhorn's algorithm constructs for s\, . . . , s,, augmented so that, 
for a £ Si, . . . , Si, the leaf storing a also stores a counter set to # a (si, ■ ■ ■ , $i) 
and a list containing the indices of a's occurrences in si, . . . , s^. Consider the 
concatenation of the lists in T m , as a permutation: its inverse sorts S. 2 Thus, 
with regard to comparisons needed, constructing T m is equivalent to sorting S; 
the following lemmas show (Hq(S) + 0(l))m comparisons suffice. 

Lemma 1. We can construct T m using at most 



£ (Pog(i-l)l+3) + W 

-IP I" i "l »" d P ^ 



ieF-{l} i<£F 

comparisons. 



log 



Proof. By induction. We can construct Ti without using any comparisons. For 
2 < i < to, suppose we have Ti_i and want to construct T^. To do this, we first 
search for in Tj_i. 



If Si £ si, . . . , Si-i, that is, i F, then our search uses log 



comparisons and ends at the leaf storing Sj. Otherwise, our search uses at most 
|~log(i — 1)] + 1 comparisons and ends at a leaf storing either Sj's predecessor or 
successor in Tj_i. 

Let a be the element stored at the leaf v where our search ends. We deter- 
mine whether a is Sj's predecessor, Si itself, or Sj's successor by checking whether 
a < Si and whether Sj < a. If a is s^'s predecessor, then we insert a new leaf 
immediately to the right of v, that stores Si, a counter set to 1 and a list con- 
taining i; if a = Si, we increment v's counter and add i to v's list; if a is Si's 
successor, then we insert a new leaf immediately to the left of v, that stores s,, 
a counter set to 1 and a list containing i. 

Notice Ti_ i's leaves now contain the same information as Ti's, which is 
enough for us to construct Ti without any further comparisons. In total, if i ^ F, 

then we use log ^ — ^ + 3 comparisons to construct Tj; otherwise, we 

use at most [log(« — 1)] + 3 comparisons. □ 

2 In fact, if each list in T m is in increasing order, then their concatenation's inverse 
stably sorts S; a stable sort preserves equal elements' relative order. 
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Lemma 2. 

]T log(j-l)+^>g *~ 1 -<(H (S)+O(l))m 

Proof. Let 

i - 1 



C= J2 log(*-l)+E l0 ? 



< log(m!) - lo S # Si • • • » ■ 

i£F 

For i £ F, if is the jth occurrence of a in S*, then j > 2 and 
l°g# Si (si, • • .,Si_!) = log(j - 1). Thus, 

#a(S) 

C < log(m!) - E lo sCj " !) 



= log(m!) - ^ log(# a (S)!) + J] log# a (S) 

a£S a£S 

< log(m!) - V log(# a (S)!) + nlog - 
* — ' n 

= log(m!) - Yl lo g(#a(S)0 + ■ 



agS 

By Stirling's Formula, 

x log x — £ In 2 < logfx!) < x log a; — x In 2 + 0(log a;) . 

Thus, 

C < mlogm - to In 2 - £ (# a (S) log # a (5) - # a (5) In 2) + O(m) 

SinCe T,aeS#a( S ) = m > 

C <Y*a(S) log ^- + 0(m) 
= (H (S) + O(l))m . 



a 



2.3 Using a Statistics Data Structure 

Let Ti , . . . , T m be as defined in Subsection l2.2l Since Mehlhorn's algorithm takes 
0(n) time time, sorting S by constructing Ti, . . . , T m takes 0(mn) time; this is 
faster than BubbleSort, for example, but still impractical. To save time, we im- 
plement all of the TiS as a single dynamic statistics data structure: an augmented 
balanced binary search tree that stores a list of triples (oi, wi, Li), . . . , (at, Wt, Tj), 
each of which consists of a key Oj, a positive integer weight Wj and a list Lj. 
None of the following operations compares keys and each takes O(logi) time [3]: 
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search(fr): return the smallest j with YjL=i Wk — ^> 
sum(j): return J2l=i w ^ 
triple(j): return (aj, Wj, Lj); 
increment(j): increment uij\ 
append(i,j): append i to Lj] 

insert(a, i, j): insert (a, 1, (i)) into the jth position in the list of triples. 

As an aside, we note there are faster statistics data structures on a word RAM 
(see we leave as future work investigating whether we can improve our 

algorithm with one of them. 

Lemma 3. Suppose we have a statistics data structure whose keys and weights 
are, respectively, the distinct elements in s\,...,Si and their frequencies. Then 
given the path from the root to a node v in Ti, we can determine the following 
in 0(log n) time: 

— if v is a leaf, the element stored at v; 

— whether v has a left child; 

— whether v has a right child; 

— if v has two children, the element stored at the rightmost leaf in v 's left 
subtree. 

Proof. Let a\, . . . , at be the distinct elements in sx,...,s% in increasing order 
and, for 1 < j < t, let 



/; = £ 



' #o k (si,..-,Si) , #a 3 0l,---,Si) 



i 2i 
k=i 

Given a binary string p, we can find the smallest j such that fj's binary rep- 
resentation begins p, if one exists: let j' be the value returned by search((.p)i) 
with .p interpreted as a binary fraction; by Tj's construction, we know the j 
we seek is either j' or j' + 1; we use sum(j'), triple(j') and triple^'' + 1) to 
compute fji and fj'+i, if they are defined. 

Let a be the path from the root to v encoded as a binary string, with each 
indicating a left edge and each 1 indicating a right edge. We can determine each 
of the following properties of v in O(logt) C O(logn) time: if there is only one 
j such that f/s binary representation begins a, then v is a leaf storing a, ; if v 
is an internal node and there is a j such that fj 's binary representation begins 
trO, then v has a left child; similarly, if v is an internal node and there is a j such 
that fj's binary representation begins oT, then v has a right child; finally, if v 
has two children, then there is a j such that fj's binary representation begins 
trO and /j+i's binary representation begins trl — the rightmost leaf in v's left 
subtree stores aj. □ 

For 2 < i < m, let a\, . . . , at be the distinct elements in s\, . . . , Sj_i. Sup- 
pose we have a statistics data structure D implementing T^_i, that is, storing 
(ai,# Ql (si,...,Si_i),Li),...,(a t ,# at (si,...,Si_i),Li) with each Lj contain- 
ing the indices of %-'s occurrences in si, . . . , Sj_i. Using D and Lemma if 
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St S Si, ... , Sj_i, then searching for S4 in Tj_i takes 



log 



# 3i (si,...,si_i) 



1 com- 



parisons and O y [log ^ — ( s l,.. s- 1) ^ J "J time, and returns j such that 
dj = Si\ otherwise, searching for Si takes at most [log(i — 1)] + 1 comparisons 
and 0((log(i — 1) + 1) logn) time, and returns j such that a,j is either Sj's pre- 
decessor or successor in T^_i. Determining whether aj is Sj's predecessor, Sj 
itself, or s^'s successor takes two more comparisons and O(logn) time. If a,j is 
Si's predecessor, then we use O(logn) time to insert (si, 1, (i)) into the (j + l)st 
position in the list of triples; if Oj = s^, then we use O(logn) time to increment 
the weight in the triple (ay, # aj (si, . . . , Sj_i), Lj) and append i to Ly; if a,j is 
Sj's successor, then we use 0(log n) time to insert (s^, 1, (i)) into the jth position 
in the list of triples. After this, D implements T,. 

We can construct a statistics data structure implementing T\ in 0(1) time 
without using any comparisons; by Lemmas^and[21 we can use (Ho(S)+0(l))m 
comparisons and 0((Hq (S) + l)m log n) time to construct a statistics data struc- 
ture implementing T m \ from this we can obtain the concatenation of the lists 
in T m , in O(ralogn) time. Therefore, we can sort 5* using (Hq(S) + 0(l))m 
comparisons and 0((Hq(S) + l)mlogn) time. 



3 Sorting S using (Hi(S) + 0(l))m Comparisons 

To generalize our algorithm, given S and £ with n e+1 \ogn 6 0(m), we work 
from left to right and maintain a set of statistics data structures, one for each 
distinct ^-tuple seen so far, and keep track of them using two dictionaries. In 
effect, we partition S, use the statistics data structures to sort each of the parts, 
and then merge them. This uses a total of (Hi(S) + 0(l))m comparisons and 
0{{Hi{S) + l)m\ogn + £m) time. 



3.1 Using a Set of Statistics Data Structures 

As we work, we maintain a statistics data structure D a for each distinct ^-tuple 
a that has occurred so far. Assume we have a black box B that works as follows: 
for £ + 1 < i < to, suppose we query B immediately before we process s^; 
if the f-tuple Si-e, . . . , S{_i has occurred before, then B returns a pointer to 
D Si _ l+u „. ;Sl _ x ; otherwise, B creates D Si _ t+1 ^,^ Si _ 1 and returns a pointer to it; 
in both cases, querying B costs O(logn) comparisons and 0(£\ogn) time. 

We use B to keep track of the statistics data structures, but we only query 
it after seeing a new distinct {£ + l)-tuple; this way, the total cost of querying 
B is 0(n t+1 \ogn) C 0(m) comparisons and 0(n e+1 £\ogn) C 0(£m) time. We 
augment the statistics data structures so that, instead of storing just triples, they 
store quadruples: each -D^,...,^ stores a list of quadruples (ai, toi, Lx,px), , . . , 
(at,u>t, Lt,pt), where pj is a pointer to Db 2 ,...,b e ,a j ', as well, -Dbi,...,^ stores the 
ranks of 62, • • • , bt, which we define and use later. 
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To process S, first, we query B to obtain a pointer to a new statistics data 
structure D slt ... iSe . For £ + 1 < i < m, we search for in the LBST that 
D Si _ e .... tSi _ 1 implements, as in Subsection 12.31 If s,_^, . . . , Sj has occurred be- 
fore, then this search returns a quadruple (si,w, L,p); we increment iu, append 
i to L, and retrieve p, which points to D Si _ t+lt .„ tSi . If Si-£, . . . ,Si has not oc- 
curred before, then we query B to obtain a pointer p to D Si t+1> _^ Si and insert 
(si, 1, (i),p) into D Si _^„ , tSi _ x . 

As in Section let be the set of ^-tuples in S and, for a G Ag, let 
be the sequence whose jth element is the one immediately following the j'th 
occurrence of a in S. In total, processing S takes 

0(m) + ( H o(S a ) + 0(l))\S a \ = (H e (S) + 0(l))m 
comparisons and 

0(£m)+ O ((H Q (S a ) + l)\S a \\ogn) = 0((H e (S) + l)mlogn + £m) 

a£A e 

time. When we finish processing S, for each a G Ag and each a € S a , D a contains 
a quadruple (a, # a (S' a ), L,p) with L containing the indices of occurrences of a 
that immediately follow occurrences of a in S. 

After we process 5", the at most n statistics data structures each contain at 
most n quadruples. Consider all these quadruples, as well as "dummy" quadru- 
ples (si, 1, (l),null), . . . , {se 7 1, (£),null); we sort them all by their keys, which 
takes 0((n i+1 + £) logn) C 0(m) comparisons and time. Now consider the con- 
catenation of their lists of indices: its inverse sorts S. To see why, notice the 
indices are in non-decreasing order by the elements they index. Thus, to prove 
the following theorem, it only remains for us to implement our black box B. 

Theorem 2. Given a sequence S — Si,...,s m containing n distinct elements 
and a non-negative integer I with r/ +1 logri G 0{m), we can sort S using 
(Hi(S) + 0{l))m comparisons and 0((Hi(S) + I)mlogn + £m) time. 

3.2 Using a Dictionary of Elements and a Dictionary of ^-tuples 

For our black box B, we use two dictionaries, both implemented as balanced 
binary search trees: By contains the at most n distinct elements seen so far, and 
Bi stores 0(logm)-bit encodings of the at most n l distinct ^-tuples seen so far. 
We search in each dictionary once per query to B, that is, once per distinct 
{£+ l)-tuple in S; we use a total of 0(n e+1 logn) C 0(m) comparisons and time 
searching in B±; we use a total of 0(n^ +1 £\ogn) C 0(£m) time searching in B2, 
but no comparisons between elements of S. 

We maintain the invariant that, immediately before we process Si, B\ stores 
a set of pairs (oi,ri), . . . , (a t ,r t ), each of which consists of a distinct element 
dj G 8\, . . . , Si-i and dj's rank rj. We say dj has rank rj if it is the r^th distinct 
element to appear in S; that is, for some k, the first occurrence of dj is Sk and 
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there are Tj distinct elements in s\, . . ., Sk- Notice operations on B\ use O(logn) 
comparisons and time. To query our black box B before processing Si, we start 
by searching for Si in B\. If this search succeeds, then we retrieve Sj's rank; if 
it fails, then Si is a new distinct element and we insert (si,r), where r is the 
number of distinct elements seen so far, including Sj. After this we have the 
ranks j"i, . . . , r^_i of Si_^ +i , . . . , Sj_i, which D at _ lt „^ ai _ x stores, and the rank m 

Of Si. 

We also maintain the invariant that, immediately before we process Sj, B2 
stores a set of pairs (<?i,Pi), • • • , {gt',Pt'}, each of which consists of an O(logm)- 
bit encoding <?j of a distinct £-tuple a in s\, . . . , Si-i and a pointer pj to D a . 
We use the gamma code 0] to encode each a £ a, which encodes any positive 
integer a; as a binary string j(x) consisting of [log 2; J copies of followed by the 
([logxj + l)-bit binary representation of x. 

Lemma 4. We can encode any sequence X\,...,xt of positive 0(\ogn)-bit in- 
tegers as a unique 0(logm)-bit binary string. 

Proof. The gamma code is prefix-free and, hence, unambiguous: any binary 
string is the concatenation of at most one sequence of encoded integers. Thus, 
the encoding 7(21) • • -7(2^) is unique and has length O(^logn) C O(logm). □ 

Notice operations on B2 use O(^logn) time and comparisons between en- 
codings of ^-tuples, but no comparisons between elements of S. To query our 
black box B, after searching for Si in B±, we search for 7(7*1) ■ • ■ 7(7^) in B2. If 
this search succeeds, then we retrieve a pointer to D Si _ e+u __ iSi ; if it fails, then 
Si-i+i, ■ ■ ■ , Si is a new distinct £-tuple, so we create a new statistics data struc- 
ture I? Si _ <+ i,...,si and insert (7(ri) • • -j(ri),p) into B2, where p is a pointer to 
D Si _ l+1 ,...,si- 
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